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ABSTRACT: 

Based on the network and data mining techniques, the protection of the confidentiality of 
sensitive information in a database becomes a critical issue to be resolved. Association analysis is a 
powerful and popular tool for discovering relationships hidden in large data sets. The relationships 
can be represented in a form of frequent itemsets or association rules. One rule is categorized as 
sensitive if its disclosure risk is above some given threshold. Privacy-preserving data mining is an 
important issue which can be applied to various domains, such as Web commerce, crime 
reconnoitering, health care, and customer's consumption analysis. 

The main approach to hide sensitive frequent itemsets is to reduce the support of each given 
sensitive itemsets. This is done by modifying transactions or items in the database. However, the 
modifications will generate side effects, i.e., nonsensitive frequent itemsets falsely hidden (the loss 
itemsets) and spurious frequent itemsets falsely generated (the new itemsets). There is a trade-off 
between sensitive frequent itemsets hidden and side effects generated. Furthermore, it should always 
take huge computing time to solve the problem. 

In this study, we propose a novel algorithm, FHSFI, for fast hiding sensitive frequent 
itemsets (SFI). The FHSFI has achieved the following goals: 1) all SFI can be completely hidden 
while without generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum 
support thresholds are allowed, and 4) only one database scan is required. 



I. INTRODUCTION 

The data mining technologies have been an important technology for discovering previously unknown 
and potentially useful information from large data sets or databases. They canbe applied to various domains, 
such as Web commerce, crime reconnoitering, health care, and customer's consumption analysis. Although these 
are useful technologies, there is also a threat to data privacy. For example, the association rule analysis is a 
powerful and popular tool for discovering relationships hidden in large data sets. Therefore, some private 
information could be easily discovered by this kind of tools. The protection of the confidentiality of sensitive 
information in a database becomes a critical issue to be resolved. 

The relationships discovered from a database can be represented in a form of frequent itemsets or 
association rules. One rule is categorized as sensitive if its disclosure risk is above some given threshold. With 
an association analyzer, if an itemset with support above a given minimal support, we call the itemset as a 
frequent itemset. 

The problem for finding an optimal sanitization of a source database with association rule analysis has 
been proven to be NP-Hard [1]. In [2,3,4,5] the authors presented different heuristic algorithms that modify 
transactions via inserting or deleting items for hiding sensitive rules or itemsets. 

Vassilios S. Verykios et al. [2] presented algorithms to hide sensitive association rules, but they 
generate high side effects and require multiple database scans. Instead of hiding sensitive association rules, 
Shyue-Liang Wang [3] proposed algorithms to hide sensitive items. The algorithm needs less number of 
database scans but the side effects generated is higher. Ali Amiri [4] also presented heuristic algorithms to hide 
sensitive items. Finally, Yi-Hung Wu et al. [5] proposed a heuristic method that could hide sensitive association 
rules with limited side effects. However, it spent a lot of time on comparing and checking if the sensitive rules 
are hidden and if side effects are produced. Besides, it could fail to hide some sensitive rules in some cases. 
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In this study, we propose a novel algorithm, FHSFI for fast hiding sensitive frequent itemsets (SFI). 
The FHSFI has achieved the following goals: 1) all SFI can be completely hidden while without generating all 
frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are allowed, and 4) 
only one database scan is required. 

The remainder of this paper is organized as follows: Section 2 presents the problem formulation and 
notations. In Section 3, we introduce the concept of the proposed algorithm for fast hiding sensitive frequent 
itemsets and giving examples to illustrate the proposed algorithm. Section 4 is the experimental results which 
present the performance and various side effects of the proposed algorithm. Section 5 is the conclusion and 
further work. 

II. PROBLEM FORMULATION AND NOTATIONS 

In Table 1, we summarize the notations used hereafter in this paper. Let I be a set of items in a transaction 
database D. 

And let I = {ii, i 2 , ..., i m }; D = {ti, t 2 , ..., t n }, where every transaction ti is a subset of I, i.e. tjEI. An 
example database is shown in Table 2. Let X be a set of items in I. If X£t ; , we say that the transaction tj 
supports X. There are nine items, III=9,be minimized. 

Table 1. Definitions of variables used in this paper 

Variable Definition 



D 


the original database 


D' 


the released database which is transformed from D 


U 


the sets of frequent item sets generated from D 


U' 


the sets of frequent item sets generated from D' 


ti 


a transaction in Database D 


Itil 


the number of items in tj 


TID 


a unique identifier of each transaction 


SFI 


the set of sensitive frequent itemsets to be hidden 


SFI.tj 


a sensitive frequent itemset in the SFI 


II ■ II 


the support count of an itemset, i.e., the number of 




transactions that support the itemset 


wi 


prior weight of ti 


PWT 


a table for storing TID and W; for each transaction 




in an order decreasing by W; 


Mid 


the maximal number of itemsets in SFI that contain 




an item ik, where ik6 ti, SFI.tjEti 


SFI.t.i 


transaction to be modified 



and five transactions, IDI=5, in the database. The support of itemset X can be computed by equation (1). An 
association rule is an implication of the form X— >Y, where Xcl, Ycl and Xfl Y= 0. A rule X— >Y will be 
extracted from a database if 

1) support(XU Y) > minsupport (a given minimum support threshold) and 

2) confidence(X UY)> minconfidence (a given minimum confidence threshold), 



where support(X U Y) and confidence(X U Y) are given by 

equations (2) and (3), . 

support(X) = 11X11 / IDI (1) 

support(X\J Y) = IIXU Yll / IDI (2) 

confidence(XU Y) = IIXU Yll / I X I (3) 



Table 2. 

Database D 



TID 


Transaction 


1 


1,2,4,5,7 
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2 


1,4,5,7 


3 


1,4,6,7,8 


4 


1,2,5,9 


5 


6,7,8 



Table 3. Frequent Itemsets 



ltemset 


Support 


1 


80% 


4 


60% 


5 


60% 


7 


80% 


1,4 


60% 


1,5 


60% 


1,7 


60% 


1,4J 


60% 


4,7 


60% 



In equation (1), 11X11 denotes the number of transactions in the database that contains the itemset X, and 
|D| denotes the number of the transactions in the database D. If support(X) > minsupport, we call X as a 
frequent itemset. Table 3 shows the frequent itemsets for a given min_support = 60%. 

For the example X = {1,4,7}, since X£t,, X£t 2 and X£ t 3 , we obtain 11X11=3. Therefore, 
support(l,4,7)=60%. Using the form X— >Y (support, confidence) for association rules, the rules generated from 
the above itemset {1,4,7} can be described as l->4,7 (60%,75%), 4— >1,7 (60%, 100%), 7->l,4 (60%,75%), 
l,4-»7 (60%, 100%), l,7->4 (60%, 100%) and 4,7— >1 (60%,100%). 

Figure 1 shows the relationships among the sets, U, U', and SFI. The study goal is to hide all SFI and 
to minimize the loss itemsets. That is, UTISFI = and the set U-U'-SFI should be minimized. 




Figure 1. The relationships among the sets, U, U', and SFI 

III. THE PROPOSED ALGORITHM 

We now demonstrate the algorithm, FHSFI. Given D, SFI, and min_support, the algorithm is to 
generate a database to be released, D', in which the sensitive frequent itemsets are hidden and the side effects 
generated are minimized. 

The sketch of the FHSFI algorithm is shown in Figure 2, which can be depicted as the following stages. 
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14 


D eterniin e whttefa item in tTID wili "be modified according to 
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Stage 2 repeats to modify transitions one-by-one until all SFI have been hidden. The order of the 
transaction modifications is according to the prior weight associated with a transition. The following tasks are 
repeated until SFI is empty. 

• Select a transaction t k from PWT such that w k is maximal. 

• Select the item to be deleted, according to the heuristic shown in Figure 4, and delete it. 

• Recompute w k after modifying each item, and then insert it into the PWT in the maintained order. 

• Subtract 1 from IISFI.tjII if SFI.tj contains the deleted item and is supported by t k . 



Remove SFI.tj from SFI, if the (IISFI.tjII / IDI)< min_support. 




Figure 3. The correlation between ti and SFI 

If IISFI.tjII / IDI < min_support 

20 then 

21 Remove SFI.tj from SFI; 

22 End; 

23 End; 



Figure 2. The pseudo code of the FHSFI algorithm 
In stage 1, FHSFI scans database once while collects all useful information about the correlation with SFI 
for each 

Table 4. 
An example of sensitive frequent itemsets, SFI 





Itemset 


1 


1,2,5 


2 


1,4,7 


3 


1,5,7 


4 


6,8 



Table 5. 
The support count for each itemset in SFI 





Itemset 


II • II 


1 


1,2,5 


2 


2 


1,4,7 


3 


3 


1,5,7 


2 


4 


6,8 


2 



transaction, including IISFI.tjII and w ; . The IISFI.tjII is used for checking if SFI.tj has been hidden. The Wi is a 
prior weight of a transaction ti, which provides a heuristic for estimating side 
effects and can be computedbyequation(4). 



w; = 1 / [2 



(ltil-1) 



/MIC,]. 



Table 4 shows an example of sensitive frequent itemset. Let ti = {1,2,4,5,7}, which supports SFI.tj, 
SFI.t 2 and SFI.t 3 . As shown in Figure 3 the correlation between tj and the SFI can be represented by a graph 
G=<V,E>. Each node is for an item i k in ti; the weight associated with each edge in E denotes the number of the 
itemsets in SFI that contain the both adjacent nodes connected by the edge. Each node can be represented as 
({SFI.tj I SFI.tj Etj, i k £ SFI.tj}, item_count S Fi.t)- For example, the node < {1,2,3}, 3> for item '1' indicates that 
three itemsets in SFI that contain the item '1', namely the SFI.tj, SFI.t 2 , and SFI.t 3 . As shown in Figure 3, item 
'1' has the maximum item_count S Fi. t which is equal to 3. Hence, we obtain MICj = 3 and Wj = 3/16. 
Figure 4 shows the heuristic procedure for determining which item to be modified and for computing MIC for 
transaction tj. 
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Heuristic (); 
Input: TID, SFI; 

Output: the item to be modified, MIC;; 

Begin 

For each SFI.t in SFI do 

Begin 

If the transaction t TID fully supports SFI.tj then 

Begin 



For each item SFI.tj.i in SFI.tj Do 

item_count S Fi.t.i + 1 \ 



item_count S Fi.t 

8 End; 

9 End; 

10 Select the SFI.t.i with maximum item_count as the item of t TID to be midified; 

1 1 Return(SFI.tj.i, item_count); 

12 End; 

Figipseudo code of heuristic procedure 
Table 6. 
The MIC and prior weight for each transaction in D 



TID 


Transaction 


Itil 


MIC 


w 


1 


1,2,4,5,7 


5 


3 


3/16 


2 


1,4,5,7 


4 


2 


2/8 


3 


1,4,6,7,8 


5 


1 


1/16 


4 


1,2,5,9 


4 


1 


1/8 


5 


6,7,8 


3 


1 


1/4 




Table 7. The exampl 
PWT 


e 








TID 


w 






1 


2 


2/8 






2 


5 


1/4 






3 


1 


3/16 






4 


4 


1/8 






5 


3 


1/16 





Table 8. Experiment results for ISFII=5 



IDI 


CPU time(ms) 


IUI 


|U'| 


#loss itemsets 


#modified entries 


5000 


326.6 


439 


428.6 


5.4 


143 


10000 


454.2 


417 


406.4 


5.6 


307.2 


15000 


701 


426 


415.6 


5.4 


513 


20000 


905 


442 


431 


6 


711.6 


25000 


1183.6 


432 


421.2 


5.8 


902.8 


30000 


1502 


443 


432.4 


5.6 


863.8 



Now, we use the following example for illustrating the proposed algorithm FHSFI. 

Example 1. Given D, SFI, as shown in Tables 2 and 4, and min_support = 40%. As shown in Table 5, 
the support count for each SFI.t can be obtained from D and SFI. For example, SFI.t 2 , { 1,4,7}, is supported by 
ti, t2, and t 3 , so IISFI.t 2 ll = 3. Table 6 lists the length, MIC, and the prior weight for each ttansaction in the 
database. The PWT, as shown in Table 7, can be obtained by sorting Table 6 in the decreasing order by w. Then, 
the first transaction, i.e., t 2 , in PWT is chosen to be modified. According the heuristic shown in Figure 3, the 
item '1' in t 2 are removed. Hence, IISFI.t 2 ll and IISFI.t 3 ll will be reduced by 1. SFI.t 3 is removed from SFI 
because the (IISFI.t 3 ll / IDI) < min_support. The process is repeated until the SFI is empty. Finally, the FHSFI 
algorithm removes the item '1' in t 2 , the item '6' or '8' in t 5 (select randomly), and the item '1' in ti. Now all 
sensitive frequent itemsets in SFI have been hidden. ■ 
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IV. PERFORMANCE EVALUATION 

We have performed our experiments on a notebook with 1.5G MHz processor and 512 MB memory, 
under Windows XP operating system. The IBM data generator [11] is used to synthesize the databases for the 
experiments. Databases with sizes 5K, 10K, 15K, 20K, 25K, and 30K are generated for the series of 
experiments. The average length of transactions of each database is 10 and 50 items in the generated database. 
The minimum support threshold given is 30%. The experimental results are obtained by averaging from 5 
independent trials with different SFIs. 

The performance of the FHSFI algorithm has been measured according to three criteria: CPU time 
requirements, side effects produced, and the number of entries modified. Tables 8 and 9 present the 
experimental results for ISFII=5 and ISFII=10, respectively. 

The CPU time requirements, side-effect evaluation, and the number of entries modified for varied IDI 
and ISFII are shown in Figures 6, 7, and 8, respectively. 
Table 9. Experiment results for ISFII=10 
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Figure 6. CPU time requirements 
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Figure 7. The side-effect evaluation 
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Figure 8. The number of entries modified 
The experimental results for FHSFI can be summarized as follows: 

• As shown in Figure 6, the CPU time is linear growth with the size of database and is scalable with the size 
ofSFI. 

• The number of loss itemsets is independent of the size of database, but linear -related with the size of SFI 
sets, which can be discovered in Figure 7. 

• The number of the modified entries depends on the size of the database and the size of SFI. However, since 
the heuristic procedures are used to determine the order of modifications, we can observe in Figure 8 that 
only a small part of transactions in the database are modified. For IDI= 10000, only 600 transactions are 
modified for completely hiding the 10 item sets in SFI. 

V. CONCLUSIONS AND FURTHER WORK 

In this paper, we have presented the FHSFI algorithm in order to fast hide sensitive frequent itemsets 
with limited side effects. The correlations between the sensitive itemsets and each transaction in the original 
database are analyzed. A heuristic function to obtain a prior weight for each transaction is given. The order of 
transactions to be modified can be efficiently decided by the weight for each transaction. This will reduce the 
time to deal with the transactions whose modification is not helpful for hiding the given sensitive frequent 
itemsets. In other words, the number of transactions in D that we have to deal with could also be reduced. 

Our approach has achieved the following goals: 1) all SFI can be completely hidden while without 
generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are 
allowed; and 4) only one database scan is required. In this research, one of our goals is hiding all SFI with 
limited side effects, but our algorithm still causes some loss rule sets. We are currently considering extensions 
on the algorithms to solve the problem. Another one is to apply the ideas introduced in this paper to fast hide 
sensitive association rules. These issues could be studied in the future. 
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